Red Wine Quality analysis by Harish Garg

About

The dataset contains records of certain checmical properties for red wines and the quality assigned to them. We are going to try to discover any relationships between these variables and the quality of the red wine.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Looking at the summary of the dataset, we can see that variables 2-12 are the chemical properties of the wines and variable 13 is the quality rating. the 1st variable, X is like a ID for the wine record and we can safely ignore that from the anaysis.

Univariate Plots Section

Plots

We start with ploting univaraiate plots for each variable to look at the distribution and some summary stats.

## [1] "Summary of Quality"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
  • The wines quality is rated at a scale of 10. This is explained in the dataset description.
  • Quality distribution is left skewed(slightly)
  • A large number of wines seems to fall into 2 categories - 5 & 6(above 80%).

## [1] "Summary of Fixed Acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

fixed.acidity seem to have an approx poisson distribution with a high concentration around fixed.acidity of ~ 8(near the median)

## [1] "Summary of Volatile Acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

volatile.acidity seem to be bi-modal distribution(~ 0.4 and 0.6) and few outliers in the higher values.

## [1] "Summary of Citric Acid"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

citric.acid seem to have an approx poisson distribution. Few of the wines seem to have no citric acid.

## [1] "Summary of Residual Sugar"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

residual.sugar seems to have a long tail on the positive side with a high concentration of around the median(~ 2.2) and quite a few outliers at the higher range.

## [1] "Summary of Chlorides"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

chlorides seems to have a long tail on the positive side with a high concentration of around the median and quite a few ouliers on the higher range.

## [1] "Summary of Free Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

free.sulfur.dioxide seem to peak around 7 and then looks like a long tailed distribution.

## [1] "Summary of Total Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total.sulfur.dioxide seem to have a similar distribution to the free.sulfur.dioxide.

## [1] "Summary of Density"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density has a normal distribution.

## [1] "Summary of pH"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH has a normal distribution.

## [1] "Summary of Sulphates"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates seems to have a long tail on the positive side.

## [1] "Summary of Alcohol"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

alcohol seem to have a similar distribution to the sulfur dioxide variables. It seems a there is a rapid rise in no. of wines around 9.5 and then looks like a lond tailed distribution.

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observations and 13 variables. These 1599 observations correspond to 1599 types of red wines.

What is/are the main feature(s) of interest in your dataset?

  • “quality” is the dependent variable.
  • Rest of the 12 variables are independent variables. We will using the how the 12 independent variables relate to the depedent variable i.e. quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Let’s begin with finding the correlation between each independent variable and the depedent variable.

##                    X        fixed.acidity     volatile.acidity 
##                0.066                0.124                0.391 
##          citric.acid       residual.sugar            chlorides 
##                0.226                0.014                0.129 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                0.051                0.185                0.175 
##                   pH            sulphates              quality 
##                0.058                0.251                1.000

Results seems to suggest we none of the indepedent variables have strong correlation with the quality. So, we would need to work with mutiple independent variables to see if we get a stronger correlation with quality.

Did you create any new variables from existing variables in the dataset?

Not yet. Maybe will update this section, if I do create more variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We are going to plot the quality variable as a factor as this will help us make it a classfication problem.

Bivariate Plots Section

Plots

We will plot the variables against quality now. We want to see how various variables change with the quality. We will plot quality on the x axis and each variable by turn on the y axis. We are using boxplots here. We are using a function here to plot all the variables, hence observations will be noted at the end of this section.

However, before plotting, , let’s calculate the correlation between the variables and the quality.

##        fixed.acidity     volatile.acidity          citric.acid 
##                0.124               -0.391                0.226 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                0.014               -0.129               -0.051 
## total.sulfur.dioxide              density                   pH 
##               -0.185               -0.175               -0.058 
##            sulphates              alcohol 
##                0.251                0.476

None of the variables have a coorelation > 0.500 with quality. The highest score is of alcohol(0.476). We are not going to plot every variable against. We will pick the ones with the highest correlation and plot those, namely, alcohol, volatile.acidity, sulphates, citric.acid.

Out of all the variables, Wine Alcohol(% by volume) has the strongest correlation with Wine quality - 0.476. Lowest qua;ity wines(i.e. with a rating of 3 & 4) has a mean alcohol % less than 11% and the highest quality wines(with a rating of 7 and above) has alcohol % higher then 11%. However, this incfrease in alcohol % with increase in quality rating doesn’t hold for wines with quality rating of 5 where the mean alcohol % is actually lower then wines with quality rating 4.

Volatile Acidity has a -0.391 correlation with Quality. And It’s visible from the plot that Volatile Acidity has a nagatve relationship with quality - Volatile acidity decreases as quality increases.

Sulphates has a correlation of 0.251 with quality. Plot shows a gradual increase in Sulphates as the quality increases.

Citric Acid has a correlation of 0.226 with quality. Plot shows a gradual increase in Citric Acid as the quality increases.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

volatile.acidity, density, pH, citric.acid, sulphates and alcohol values change as the quality changes.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

##                           X fixed.acidity volatile.acidity citric.acid
## X                     1.000        -0.268           -0.009      -0.154
## fixed.acidity        -0.268         1.000           -0.256       0.672
## volatile.acidity     -0.009        -0.256            1.000      -0.552
## citric.acid          -0.154         0.672           -0.552       1.000
## residual.sugar       -0.031         0.115            0.002       0.144
## chlorides            -0.120         0.094            0.061       0.204
## free.sulfur.dioxide   0.090        -0.154           -0.011      -0.061
## total.sulfur.dioxide -0.118        -0.113            0.076       0.036
## density              -0.368         0.668            0.022       0.365
## pH                    0.136        -0.683            0.235      -0.542
## sulphates            -0.125         0.183           -0.261       0.313
## alcohol               0.245        -0.062           -0.202       0.110
##                      residual.sugar chlorides free.sulfur.dioxide
## X                            -0.031    -0.120               0.090
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
##                      total.sulfur.dioxide density     pH sulphates alcohol
## X                                  -0.118  -0.368  0.136    -0.125   0.245
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
  • In terms of coorelation between other variables, below moderate relationships are observed(> 0.500)… ** pH and fixed.acidity(0.683) ** pH and citric.acid(0.542) ** fixed.acidity and density(0.668) ** free.sulfur.dioxide and total.sulfur.dioxide(0.668) ** fixed.acidity and citric.acid(0.672) ** volatile.acidity and citric.acid(0.552)

The strongest relationship is between pH and fixed.acidity(0.683)

Multivariate Plots Section

Plots

We are picking alchol and volatile.acidity to plot against the quality as they seem to have the strongest relationship with the quality as compared with other variables. Alcohol has mostly postive relationp with quality where the wines with higher alcohol % are rated higher, quality wise. Volatile Acidity has mostly negative relationship with quality. So putting thee together, wines with higher alcohol % and lower volatile acidity seem to be rated higher quality more often and wines with lower alcohol % and higher voaltile acidity seem to be rated lower quality more often.

Here, we plotted volatile acidity and citric acid colored by quality. As compared to the previous plot (alcohol and volatile acidity), the distinction between low and high quality wines is not as clear here. However, we see that the wines rated higher seem to have higher citric acid and low volatile acidity and the wines rated lower seem to have lower citric acid and higher volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The high quality wines tend to have a higher alcohol content and lower volatile acidity content. Similarly, higher rated wines seem ti have higher citric acid content and lower volatile acidity content.

Were there any interesting or surprising interactions between features?

No, didn’t see any worth mentioning.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Quality is the rating of the wines on a scale of 1-10. There are not wines in the dataset which are rated 1,2, 9 or 10. A very high number of wines are of the medium i.e. quality 5 or 6(> 4/5ths). Also, the Quality distribution is left skewed(slightly).

Plot Two

Description Two

Out of all the variables, Wine Alcohol(% by volume) has the strongest correlation with Wine quality - 0.476. Lowest qua;ity wines(i.e. with a rating of 3& 4) has a mean alcohol % less than 11% and the highest quality wines(with a rating of 7 and above) has alcohol % higher then 11%. However, this incfrease in alcohol % with increase in quality rating doesn’t hold for wines with quality rating of 5 where the mean alcohol % is actually lower then wines with quality rating 4.

Plot Three

Description Three

We are picking alchol and volatile.acidity to plot against the quality as they seem to have the strongest relationship with the quality as compared with other variables. Alcohol has mostly postive relationp with quality where the wines with higher alcohol % are rated higher, quality wise. Volatile Acidity has mostly negative relationship with quality. So putting thee together, wines with higher alcohol % and lower volatile acidity seem to be rated higher quality more often and wines with lower alcohol % and higher voaltile acidity seem to be rated lower quality more often.


Reflection

The dataset comprises of data for 1599 red wines, rated on a quality scale of 1-10. Every record has, apart from the quality rating, 12 variables describing various checmical attributes for that wine. Our goal in this analysis was to find out the relationhips between these checmical attributes and the quality rating of the wines. 

I begin by examining the variables independently, by looking at their ranges and distributions. I also convreted quality into a factor variable to help with the classification. After that, I calculated the coorelations between each of the independent variables and quality. We didn't find very strong relationships. Although 2 variables did stand out - alcohol and volatile acidity. Then we plotted the various variables against quality and discovered that Alcohol, volatile acidity, sulphates, and citric acid seem to show some relationship with quality, although not very strong. I chose boxplots to plot the bivariate data, which helped me see the distribution. I then picked up the variables with the highest coorelation vaues with quality and plotted them along with the quality. Here, we some relationship emerging between alcohol, volatile acidity and quality.

Earlier, I tried plotting variables against quality using a scatter plot without converting quality into a factor. However, that didn't produce any insights or distribution. However, after converting quality into a factor variable and using a boxplot, certain distributions quite clear.

To conclude, there are lot of ways this analysis can be improved. data about ore wines, espcially low and high quality will help. Maybe, more chemical attributes need to be recorded. Also, applying some kind of Machine Learning algorithm would also help.